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Abstract 

Statistical methods development for differential expression analysis of RNA se- 
quencing (RNA-seq) requires software tools to assess accuracy and error rate control. 
Since true differential expression status is often unknown in experimental datasets, 
artificially-constructed datasets must be utilized, either by generating costly spike-in 
experiments or by simulating RNA-seq data. Polyester is an R package designed to 
simulate RNA-seq data, beginning with an experimental design and ending with col- 
lections of RNA-seq reads. The main advantage of Polyester is the ability to simulate 
isoform-level differential expression across biological replicates for a variety of experi- 
mental designs at the read level. Differential expression signal can be simulated with 
either built-in or user-defined statistical models. Polyester is available on GitHub at 
https: / / github.com/alyssafrazee/polyester. 



1 Introduction 

RNA sequencing (RNA-seq) experiments have become increasingly popular as a means to 
study gene expression. There are a range of statistical methods for differential expression 
analysis of RNA-seq data [6]. The developers of statistical methodology for RNA-seq need to 
test whether their tools are performing correctly. Often, accuracy tests cannot be performed 
on real datasets because true gene expression levels and expression differences between pop- 
ulations are usually unknown, and spike-in experiments are costly in terms of both time and 
money. 
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Instead, researchers often use computational simulations to create datasets with a known 
underlying truth. Polyester is a new R package for simulating RNA-seq reads. Its main 
advantage is its built-in functionality for inducing differential expression into the simulated 
experiment at the gene or isoform level at the sequencing step of the RNA-seq pipeline. In 
the literature, simulated expression measurements used to evaluate differential expression 
tools are usually generated directly from a statistical model (e.g. [7], [1]), but these simu- 
lated scenarios do not account for variability in expression measurements that arises during 
upstream steps in RNA-seq data analysis, such as read alignment. 

Many existing RNA-seq read simulators were not designed for directly simulating experi- 
ments with biological replicates and differential expression. For example, the rsem-simulate-reads 
utility shipped with RSEM [5] requires aligning real sequencing reads to develop a sequencing 
model before simulating reads and differential expression simulation is not built-in. Neither 
FluxSimulator [4] nor BEERS [3] have a built-in mechanism for inducing differential expres- 
sion, nor do they provide a way to define a model for feature expression measurements across 
replicates. TuxSim has been used to simulate RNA-seq datasets with differential expression 
[8], but it is not publicly available. 

Polyester was created to fulfill the need for a tool to simulate RNA-seq reads for an 
experiment with replicates and well-defined differential expression. Users can easily simulate 
small experiments from a few genes or a single chromosome. This can reduce computational 
time in simulation studies when computationally intensive steps such as read alignment must 
be performed as part of the simulation. Polyester is open-source, cross-platform, and freely 
available for download at https://github.com/alyssafrazee/polyester. 

2 Methods 

Polyester takes annotated transcript sequences as input. These can be provided as cDNA 
sequences FASTA format, or as a GTF file with gene annotations along with full-chromosome 
DNA sequences. The number of reads for each gene or transcript can be set using Polyesters 
built-in model for differential expression. The built-in transcript abundance model assumes 
that the distribution of the number of reads to simulate from each transcript is negative 
binomial, across biological replicates. The negative binomial model for read counts has been 
shown to satisfactorily capture biological and technical variability ([1], [7]). 

Specifically, define Yij}. as the number of reads simulated from replicate i, experimental 
condition j, and transcript k (i — 1, rij, j = 1, 2, and k = 1, N, where rij is the number 
of replicates in condition j and N is the total number of transcripts provided). Polyester 
assumes Yy* ~ NegativeBinomial(mean = /ij^, dispersion = r^). In this negative binomial 

parameterization, E\Y^^) = [i^ and Var(Yyfc) = y,^ H — — , so each transcript's expression 
variance is quadratically related to its baseline mean expression. The user can provide fijk 
for each transcript k and experimental group j. In particular, the user can relate transcript 
k's length to /i^. 

By default, = which means Var(Yyfc) = ifijk- The user can adjust on a per- 
transcript basis as needed, to explore different mean/variance expression models. Differential 
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expression can be set by providing a fold change for each transcript. Initially, a baseline mean 
fMk is provided for each transcript, and fiik and fi2k are set to /i^. Then, if fold change A is 
provided, [i\k and n<ik are adjusted: if A > 1, = A/i^, and if A < 1, fi2k = \l*k- 

Alternatively, polyester allows users to individually specify the number of reads to gener- 
ate from each transcript, for each sample. This provides the capability to simulate complex 
experimental designs, such as timecourse experiments or multi-group studies, and to explore 
the effects of a wide variety of experimental parameters on differential expression results. 

After the transcripts have been defined and each transcript's abundance in the simulated 
experiment has been determined, polyester simulates the RNA sequencing process, described 
in detail in [6], beginning at the fragmentation step: first, all transcripts present are broken 
into short fragments. Fragment lengths are drawn from a normal distribution, with mean 
Hfi and standard deviation c/j. By default, Hfi = 250 nucleotides and = 25. Fragment 
locations are chosen at random. Polyester simulates reads from an unstranded protocol, so 
each fragment is reverse-complemented with probability 0.5: it is equally likely that a cDNA 
fragment originated from either the Watson or Crick strand. 

After fragmentation is the sequencing step: the first R nucleotides are read from each 
fragment. If paired-end reads are desired, the last R nucleotides are also read from that 
fragment. The default read length is R — 100. Polyester assumes a uniform sequencing 
error model, so an incorrect nucleotide is read from the fragment with probability p e (default 
0.005). Simulated reads are automatically written to FASTA files on disk. The FASTA files 
identify which reads originated from which transcripts, facilitating assessment of downstream 
alignment accuracy. 

3 Example Analysis 

To demonstrate a use case for polyester, we simulated a small differential expression ex- 
periment and tested the accuracy of Ballgown's statistical models [2] at detection of this 
differential expression. The built-in read count model was used, and for each transcript 
k on human chromosome 22 (hgl9 build, N = 918), /i^ was set to length(transcript fc )/5, 
which corresponds to approximately 20x coverage for 100-bp reads. We randomly chose 75 
transcripts to have A = 3 and 75 to have A = 1/3; the rest had A = 1. For nj = 7 replicates 
in each group j, we simulated paired-end reads from 250-base fragments (o~fi = 25), with 
error probability 0.005. Simulated reads were aligned to hgl9 with TopHat 2.0.11 [9], and 
Cufflinks 2.2.1 [10] was used to obtain transcript-specific expression estimates. We then ran 
transcript- level differential expression tests using Ballgown [2], and by matching each Cuf- 
flinks transcript to the closest transcript in the annotation set used to generate the reads, 
we were able to examine Ballgown's accuracy using an ROC curve (Figure 1). 

This example is just one of many ways polyester can be used to explore the effects of 
analysis choices on differential expression results. Data and scripts used in the example 
analysis are available at https://github.com/alyssafrazee/polyester_code. 
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Figure 1: ROC curve for detecting transcript- level differential expression simulated with 
polyester. The pre-set differential expression was accurately detected using Ballgown's dif- 
ferential expression tests. 
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